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1 Introduction 



Manipulating probability distributions is central to data compression, so it 



is natural to ask how well we can compress probability distributions them- 
selves. For example, this is useful for probabilistic reasoning [2] and query 



O 

o 
in 
o _ 

optimization [6]. Our interest in it stems from designing single-round asym- 



metric communication protocols [1,4]. Suppose a server with high bandwidth 

X' 

^ ■ wants to help a client with low bandwidth send it a message; the server knows 

the distribution from which the message is drawn but the client does not. If the 
distribution compresses well, then the server can just send that; conversely, if 
the server can help the client in just one round of communication and without 
sending too many bits, then the distribution compresses well — we can view 
the server's transmission as encoding it. 

Compressing probability distributions must be lossy, in general, and it is not 
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always obvious how to measure fidelity. In this paper we measure fidelity using 
relative entropy because, in the asymmetric communication example above, 
the relative entropy is roughly how many more bits we expect the client to send 
with the server's help than if it knew the distribution itself. Let P — pi, . . . ,p n 
and Q — q±, . . . , q n be probability distributions over the same set. Then the 
relative entropy [9] of P with respect to Q is defined as 

£>(P||Q)=X>log^ . 

i=i it 

By log we mean log 2 . Despite sometimes being called Kullback-Leibler dis- 
tance, relative entropy is not a true distance metric: it is not symmetric and 
does not satisfy the triangle inequality. However, it is widely used in mathe- 
matics, physics and computer science as a measure of how well Q approximates 
P[3]. 

We consider probability distributions simply as sequences of non-negative 
numbers that sum to 1; that is, we do not consider how to store the sam- 
ple space. In Section 2 we show how, given a probability distribution P = 
Pi, . . . ,p n , we can construct a probability distribution Q = q±, . . . , q n with 
D(P\\Q) < 2 and store Q exactly in 2n — 2 bits of space. Constructing, storing 
and recovering Q each take 0(n) time. We also show how to trade compres- 
sion for fidelity and vice versa. Finally, in Section 3, we show how to store a 
compressed probability distribution and query individual probabilities without 
decompressing it. 
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2 An Algorithm for Compressing Probability Distributions 

The simplest way to compress a probability distribution P is to construct and 
store a Huffman tree [5] for it. This lets us recover a probability distribution 
Q with D(P\\Q) < 1 [10,14] but takes both fi(nlogn) time and Vt{n\ogn) 
bits of space. In this section, we show how to use the following theorem, due 
to Mehlhorn [11], to compress P by representing it as a strict ordered binary 
tree. A strict ordered binary tree is one in which each node is either a leaf or 
has both a left child and a right child. We show how to trade compression for 
fidelity, by applying this result repeatedly, or trade fidelity for compression, 
using another approach. 

Theorem 1 (Mehlhorn, 1977) Given a probability distribution P — p 1: . . . ,p. 

we can construct a strict ordered binary tree on n leaves that, from left to right, 
have depths less than log(l/pi) + 2, . . . , log(l/j9 n ) + 2. This takes 0{n) time. 

PROOF SKETCH. For 1 < % < n, let 

Si = % + Y,Pi- 

A 3=1 

Consider the code in which the ith codeword is the first [log(2/pj)] bits of 
the binary expansion of these bits suffice to distinguish Si, so the code is 
prefix-free. Notice the ith. leaf of the corresponding code-tree has depth less 
than log(l/pi) + 2. □ 
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Once we have used Theorem 1 to get a strict ordered binary tree T, we store T. 
It is important that T be ordered; otherwise, it would only store information 
about the multiset {pi : 1 < i < n}, rather than the sequence P, so we would 
also need a permutation on n elements, which takes 6(nlogn) bits. 

Theorem 2 Given a probability distribution P = pi,...,p n , we can con- 
struct a probability distribution Q = q 1 ,...,q n with maxi<i< n {pi / qi} < 4, 
so D(P\\Q) < 2, and store Q exactly in 2n — 2 bits of space. Constructing, 
storing and recovering Q each take 0{n) time. 

PROOF. We apply Theorem 1 to P to get a strict ordered binary tree T on n 
leaves that, from left to right, have depths di, . . . , d n with di < log(l/pj) + 2. 
We store T in 2n — 2 bits of space, represented as a sequence of balanced 
parentheses. 

For 1 < % < n, let 

'''' r~T^ • 

Since T is strict, by the Kraft Inequality [8], 
£2-^ = 1; 

thus, qi = 2~ d > > pi l 'A. □ 
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Using Theorem 2 as a starting point, we can improve fidelity at the cost of 
using more space. One approach is given below; we leave as future work finding 
better tradeoffs. 

Theorem 3 Given a probability distribution P = pi,...,p n and an inte- 
ger k > 2, we can construct a probability distribution Q = q 1 ,...,q n with 
max!<j< n {pj/^} < 2 + ^3, so D(P\\Q) < log (2 + ^r), and store Q exactly 
inkn — 2 bits of space. Constructing, storing and recovering Q each take 0(kn) 
time. 



PROOF. By induction on k. By Theorem 2, the claim is true for k = 2. Let 
k > 3 and assume the claim is true for k — 1. 



Let Q' = q[, . . . , q' n be the probability distribution we construct when given P 
and k — 1. Let B — b\ • • • b n be the binary string with bi = 1 if Pijq\ > 1 + -^hs, 
and 6 = otherwise. For 1 < % < n, let 



Qi = \ 



if bi = 1, and 



if h = 0. 



Notice we can store Q exactly in kn — 2 bits of space, using (k — l)n — 2 bits 
for Q' and n bits for B. Also, 

E H + E ^' = E% + E ^' < 1 + E T-fk^ ^ ¥^TT ■ 

fej=l 6j=0 i=l fej=l fej=l ' 
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If bi — 1, then q^ > 2q[ ■ ^-2+} • Since, by assumption, q\ > 2 +\/2 k - i ' we nave 
2pi 2 fe - 3 + 1 



2 + i/ 2 fc-4 2 fe " 2 + l 



and so < 2 + ^3. If 6, = 0, then q\ > 1+1 P / 2 k-3 ■ Since ^-2^ , we 

have 

Pi 2 fe " 3 + 1 

Qi > l + l/2 fe - 3 ' 2 fe - 2 + 1 

and so < 2 + 

By assumption, constructing Q' takes 0((/c — l)n) time and constructing B 
and Q from Q' takes O(ra) time. Thus, constructing, storing and recovering Q 
each take 0(kn) time. □ 



It may be possible to strengthen Theorem 2 using results about alphabetic 
Huffman codes (e.g., [13]). We base it on Theorem 1 for two reasons: Mehlhorn's 
construction takes 0(n) time, whereas known algorithms for constructing 
alphabetic Huffman codes take f2(nlogn) time [7], and the guarantee that 
maxi<j< n {pi/gi} < 4 makes the proof of Theorem 3 cleaner. 

Using a different approach, we can also reduce the space used at the cost of 
reducing fidelity. 

Theorem 4 Given a probability distribution P = p 1 , . . . ,p n and c > 1, we 
can construct a probability distribution Q = qi,...,q n with D(P\\Q) < c- 
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H(P) + log(7r 2 /3) and store Q exactly in at most [n 1 /^ 1 )] ( [lognj + 1) bits 
of space. Constructing, storing and recovering Q each take 0{n) time. 



PROOF. Let t < \n 1 ^ c+1 ^\ be the number of probabilities in P that are at 
least i/(e + i) ■ Let r±, . . . , r t be such that p rj is the jth largest probability in P, 
and let R = {r l5 . . . , r t }. Thus, 



Pn > ■ ■ ■ > Pn > ^TT^ny > max{ P! } . 



Computing the set R takes 0(n) time and sorting it takes 0(n 1 ^ c+1 ^ logn) C 
0(n) time. For 1 < j ' < t, let q rj = 3/(7r/') 2 ; since 

vl = Hi 

ki 2 ~ 6 ' 

we have 

* 1 

E Qrj < 9 • 

J=l z 

For i £ i2, let 

* = — ^ > • 

n — r in 

Storing Q as the binary representations of r±, . . . ,r t takes at most L^ 1 ^ " 1 " 1 ^ 
([lognj + 1) bits of space and 0(n) time. 

For 1 < j < t, since p rj is the jth largest probability in P, we have p r . < 1/j. 
Therefore, 



n | 

7=1. Pi 
t 

>Y.Pr 3 log j + 53 Pi I0gn 1/{C+1) 
3=1 i<?R 

= L^-logj + -— : • 



Compare this with 



£>(p||g)=E^ lo g- 

i=l 9i 



* 1 1 

= 53 ^ lo § + 53 Pi lo § ^( P ) 

3=1 1 r j igR & 

< E Pr, log^ + E Pi l0 S( 2 ^) - ^( P ) 
J'=l i£R 

* /tt 2 \ * 

= 2^3 Pr 3 l0gj + Y,Pi l °& n + l °& -o- + 53 Pi ~ ^( P ) 

j=l i^R V d / j=l i£R 

* 7T 2 

< 2 53 JV, log j + 53 Pi l°g n + l°g "T ~ H ( p ) ■ 

3=1 igR 6 



Since c > 1, 



D(P||g)-log(7r 2 /3) 2E;- = iPr J logj + Ei^Pilogn 



that is, D(P||Q) < c • H(P) + log(7r 2 /3). □ 



If space is at a premium, we may need to work with Q without decompress- 
ing it. Notice we can do this by storing {(ri, 1), . . . , (r t ,t)} in order by first 
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component, which takes at most 



\n l ' ic+ ^\ \\ogn\ + 



c+1 



logn 



+ 2 



) 



bits of space. Given % between 1 and n, we can compute 



3/(vrj) 2 



Qi = S 



i-E.^i3/(^) 2 



if i {n,...,r t } 



n—t 



in O(logt) C 0(log(n)/c) time. 

3 A Data Structure for Compressed Probability Distributions 

In this section, we show how to work with a probability distribution com- 
pressed with Theorem 2 without decompressing it, using a succinct data struc- 
ture due to Munro and Raman [12]. This data structure stores a strict ordered 
binary tree on n leaves in 2n + o(n) bits of space and supports queries that, 
given a node, return its parent, left child, right child and number of descen- 
dants. Each of these queries takes 0(1) time. Notice that, given i between 1 
and n, we can find the depth d of the ith leaf in 0(d) time. 

Theorem 5 Given a probability distribution P — p 1 , . . . ,p n , we can construct 
a data structure that uses 2n + o(n) bits of space and supports a query that, 
given i between 1 and n, returns qi in 0(log( !/<&)) time. Here, Q — q 1 , . . . ,q n 
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is a probability distribution with maxi<i< n {p;/g;} < 4, so D(P\\Q) < 2 and 
0(log(l/g i ))CO(log(l/p i ))- 



PROOF. As for Theorem 2, but with the sequence of balanced parentheses 
replaced by an instance of Munro and Raman's data structure. □ 

A drawback to Theorem 5 is that querying a very small probability might take 
Q(n) time. We can fix this by smoothing the given probability distribution 
slightly. 

Theorem 6 Given a probability distribution P = p±, . . . ,p n and e > 0, we 
can construct a data structure that uses 2n + o(n) bits of space and supports a 
query that, giveni between 1 andn, returns q^ in 0(log(l/%)) time. Here, Q = 
qi, . . . ,q n is a probability distribution with D(P\\Q) < 2 + e and log(l/gj) G 
0(logmin(l/pi,n/e)). 



PROOF. Let P' = pi, ...,p' n , where 



Pi e/4 
Pi = r-; — 77 + 



1 + e/4 (1 + e/4)n 



We apply Theorem 5 to P'\ let Q be the stored distribution. Notice 



max < max (^1 • max (^1 < f 1 + - V 4 = 4 + e . 

i<i<«[gij i<i<«[p-J i<«<n[gij V 4/ 
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Since log is convex, it follows that D(P\\Q) < 2 + e. Since 



qi > max 



( 



Pi _e_ 
4 + e' 4n 



') 



we have log(l/gj) G 0(logmin(l/pj, n/ej). □ 
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